Prediction of Medical Health Expenses

I) Objective:

Primary Objective:

To Predict the future medical health expense.

Secondary Objectives:

To Check the relationship between sex and smoker.
To Identify the factors affecting the medical expenses.
To Understand the impact of gender, number of children and region on medical expenses.

II) Problem Statement:

Everyone’s life revolves around their health. Good health is essential to all aspects of our lives. Health refers to a person’s ability to cope up with the environment on a physical, emotional, mental, and social level.Because of the quick speed of our lives, we are adopting many habits that are harming our health. One spends a lot of money to be healthy by participating in physical activities or having frequent health check-ups to avoid being unfit and get rid of health disorders. When webecome ill we tend to spend a lot of money, resulting in a lot of medical expenses.So, an application can be made which can make people understand the factors which are making them unfit, and creating a lot of medical expenses, and it could identify and estimate medical expense if someone has such factors.

III) Data Description:

The dataset contains 1338 rows and 7 columns. The columns present in the dataset are ‘age’,‘sex’, ‘bmi’, ‘children’, ‘smoker’, ‘region’, and ‘expenses’. The expenses column is the target column and the rest others are independent columns. Independent columns are those which will predict the outcome.

Age: The first column is Age. Age is an important factor for predicting medical expenses because young people are generally more healthy than old ones and the medical expenses for Young People will be quite less as compared to old people.

Sex: The Next column is Sex, Which has two Categories in this column: Male and Female. The sex of the person can also play a vital role in predicting the medical expenses of a subject.

bmi: After that, you have the bmi column, then BMI is Body Mass Index. For most adults, an ideal BMI is in the 18.5 to 24.9 range. For children and young people aged 2 to 18, the BMI calculation takes into account age and gender as well as height and weight. If your BMI is less than 18.5, you are considered underweight. People with very low or very high ‘bmi’ are more likely to require medical assistance, resulting in higher costs.

Children: The forth column is the ‘children’ column, which contains information on how many children your patients have. Persons who have children under more pressure because of their children’s education, and other needs than people who do not have children.

Smoker: The fifth is the ‘smoker’ column. The Smoking factor is also considered to be one of the Most Important factors as the people who smoke are always at risk when their age reaches 50 to 60.

Region: Next is the ‘region’ column. Some Regions are Hygienic, Clean, Neat, and Prosperous, But some Regions are not, and this information affects health which is related to medical expenses.

Expenses: The last column is ‘expenses’, which is target column and the rest others are independent columns. Individual medical costs billed by health insurance. Independent columns are those which will predict the outcome

Import Libraries

Read the Data

Check out the info(), head(), and describe() methods on insurance.

IV) EDA(Exploratory Data Analysis)

a) Univariate Data Analysis

* Distribution of Smoker, Children and Regions

Interpretation:

From above diagram, In smoker column we have to found 20.48% of the subjects are smokers and 79.52% are non-smoker.

Using count plot we have shown the subjects having children ranging from 0 to 5 and it has been computed and observed from the count plot also that those who are having no children are highest in number.

We have again used a pie chart to plot the number of inhabitants in the region column which consists of four segments: Northeast, Northwest, Southeast, Southwest. The number of Southwest and Northwest are the same and the value is 324, but the number of inhabitants in Northeast and Southeast are respectively 324 and 364.

Distribution Plots:

* Distribution of age, BMI and charges

Interpretation:

In the above diagram we conclude that, We have an equal number of people of all ages.

And the BMI of the patients seems to be normally distributed where maximum people have BMI around 30 and very few people have less BMI around 10, similarly very few people have high BMI around 60.The given distribution is right- skewed.

* Skewed distributed charges and Normal distributed charges

b) BIVARIATE DATA ANALYSIS:

Interpretation:

In above diagram, It has been noticed that with an increase in age the medical expenses have increased but some people are of higher age but have lower medical expenses. And In the above figure trend line shows the expense and age have a linear relationship

* We have plotted the scatterplot between BMI and charges.

Interpretation:

In above diagram, it has been noticed that most of the datais situated at the bottom of the trend line indicating that people with high, low and medium BMI can have low expenses, which is an irregular pattern, but if we takea look at the trend line we can notice that people with high BMI, their medical expense will be high, so we can conclude that for people with high BMI the expense may be increased but in rare case

* We plotted the bar plot between the children and expenses

Impact of Smoking and Children’s on Expenses

Interpretation:

a)From above figure, it has been noticed people with more number of children the charges is more as with more children parents need to take care of the health of all of them rather those who have no children or one child, but as there is very less number of people who are having more than 3 children so for them the charges is almost same. The people with 3 children have the highest charges among them.

b) In above figure, we see that the relationship between smokers and charges through box plots and it has been noticed that for smokers the charges is much high than nonsmokers as it is obvious because smoking is injurious to health so smokers are likely to have health issues than nonsmokers causing their medical charges to increase.

c) MULTIVARIATE DATA ANALYSIS:

Interpretation:

It is noticed from the chart that BMI is not a powerful factor as people having less BMI also have high medical expenses and it is very clear from the chart that people who smoke definitely have high medical expenses. Therefore, the size of the bubble indicates age and it has been noticed that with an increase in bubble size that means with an increase in age medical expenses increase.

V) Model Building:

Linear Regression:

Linear Regression was applied to predicted Future Medical Expenses for your Patients based on certain features such as Age, Gender, Region, Smoking Behavior, and Number of children. We computed R2 and RMSE values which were obtained as 0.79 and 5673.09 which means 79 % of the variation of target column expense can be well explained by one of the predictor variables. And RMSE interprets that the expected expense can be more or less than 5673 of the actual expense.

Here we found for two variables sex and region we got a p-value of more than 0.05 which means these are insignificant, so to confirm this we did a variance inflation factor test on the independent variables to show the multicollinearity.

Interpretation:

  1. p-value for F statistic is < 0.05 (we will consider 0.5 as significance level for this project) so we can say that our model is significant (at least for one independent variable the regression coefficient. is not equal to zero, rejecting the null hypothesis, The null hypothesis under this is “all the regression coefficients are equal to zero”.) In our case, The F statistic value is 1120, and Prob(F-Statistics) is much less than 0.05 so we can say our model is quite significant.
  2. R-squared value is 0.835 which means 83.5 % variation of output can well be explained by independent variables of the model.
  3. If the Adjusted R2 is much less than the R square value it means there are variables in the dataset which is irrelevant to the model, which are not impacting the target variable. In our case, the R square value is 0.835 and the adjusted R-square is 0.834 which means our model has no irrelevant features which are not explaining the target output. 4.Now we will check p values for each attributes if the p values are <0.05 we will say that the attribute is contributing to the model (reject the null hypothesis(coef=0)) if > 0.05 then the attribute is insignificant (accept the null hypothesis(coef=0)).In our model there is no feature which is having pvalue > 0.05 which means there is a linear relationship of target variable with all the

Random Forest Regressor:

Random Forest is an ensemble learning method for classification and regression by constructing multiple numbers of decision trees at training time and Outputting the average prediction of the individual trees in case of regression whereas outputting the class that is the mode of the classes in case of classification.

It is one of the Most powerful Machine Learning algorithms which works well in most cases.

First of all, the RandomForestRegressor package was imported from sklearn ensemble library, so that we can use this model to predict the Expenses.

After that, we specified a Model using this Random Forest Regressor Class.

Now, as the Model is Ready, trained our Model using the Training data, for that fit function, was used and used the training data.

Here, the Training Data refers to x_train and y_train. Where x_train is the independent variable, and Y_train is the dependent variable or target variable.

After the Model gets trained, we started performing Predictions using this Predictive Model created using the Random Forest Regressor.

To do that, used predict function and specified the independent variables inside the function, to get the predictions and saved the result produced by the Random forest in a new variable, so that we can compare the Results later if required.

After building the predictive model, evaluated the Model using various Performance Metrics.

In this case, we checked the R2 score and RMSE Score as in the case of the linear regressor. In the case of the Random Forest model, the RMSE score comes out to be 0.4188 whereas the r2 score comes out to be 0.789 which makes it clear that Random Forest works much better than Linear Regression for this Dataset

Gradient Boosting Regressor:

Random Forest is also an Ensemble, But In Random Forest the Ensembling happens Parallelly,

But, In the case of Gradient Boosting, the Ensembling happens Sequentially, which means that the First Model’s Errors will be used to Build the Second Model and the Second Model’s Error will be used to Build the Third Model, and so on. The Models will be built until and unless the Errors are optimized in the best way.

That means By using the Gradient Boosting Models we can make the least error possible.

First of all, Using the Gradient Boosting Regressor Model is to be imported from the sklearn.ensemble library.

After that, make a Base Gradient Boosting Regressor Model, and then we trained this Model using the fit function on the Training Data that is x_train and y_train.

Where, X_train is your Independent Variable, and y_train is your Dependent Variable.

After the Model is built we predicted the Target Variable for our test data using the predict function and save the result in a new variable, so that to compare the results later.

After that, we performed Model Evaluation using the R2 score and RMSE score Performance Metrics as did for the Last Two Models and we obtained the RMSE Score as 0.2266, whereas the R2 score comes out to be 0.838

Comparing Performance of three models:

We have created a NumPy array of the r2 score of all three models, Linear Regression, Random Forest, and Gradient Boosting.

An array for the labels was also created as well, to compare these Values using bar charts. Here a Rainbow palette has been used and the Bar plot built using the seaborn Library shows a higher r2 score value for Gradient Boosting and lowest for Linear Regression.

That means, the Gradient Boosting Model, is the Best Choice whereas the Linear Regression Model is the worst for this Case thus we have successfully built our predictive model and compared these predictive models based on their accuracies and Results. Below is the bar plot to show the performance of the three used models.

VI) MAJOR FINDING:

We came to know that the Most Important Factor to Predict the Medical Expenses of a subject is Smoking Behavior and Age, that means, smoking is Bad for Health, as already know that and which inevitably increases medical expenses as due to smoking one is likely to fall ill more than the nonsmokers.

We also found that with increasing of age, one needs to take some more care and precautions for your health as with the increase of age health becomes fragile so they go for frequent medical check-up, likely to fall ill quickly as with the increase of age immunity falls so they adopt measures to stay healthy by taking medicines and engaging in some physical activities like jogging, walking, Yoga which causes an increase of medical expenses.

We have built three models among which the Gradient Boosting Regressor model shows the best result through which we can say 83.2% variability of expenses can well be explained by predictor variables and which yields comparatively low RMSE value so our predicted expense through this model will not vary too much from the actual expense.

VII) SCOPE AND LIMITATIONS OF THE STUDY:

Predict the future medical health expenses based on certain features building a robust machine learning model. The value of health forecasting the future of health care in the various states beacaues, front-line health delivery services and providers are not usually adequately informed and do not have adequate resources to meet the needs of a higher than normal demand for health care.